# Load soil data from sampling locationsbern_data <- readr::read_csv( here::here("data-raw/soildata/berne_soil_sampling_locations.csv") )# Display datahead(bern_data) |> knitr::kable()
site_id_unique
timeset
x
y
dataset
dclass
waterlog.30
waterlog.50
waterlog.100
ph.0.10
ph.10.30
ph.30.50
ph.50.100
4_26-In-005
d1968_1974_ptf
2571994
1203001
validation
poor
0
0
1
6.071733
6.227780
7.109235
7.214589
4_26-In-006
d1974_1978
2572149
1202965
calibration
poor
0
1
1
6.900000
6.947128
7.203502
7.700000
4_26-In-012
d1974_1978
2572937
1203693
calibration
moderate
0
1
1
6.200000
6.147128
5.603502
5.904355
4_26-In-014
d1974_1978
2573374
1203710
validation
well
0
0
0
6.600000
6.754607
7.200000
7.151129
4_26-In-015
d1968_1974_ptf
2573553
1203935
validation
moderate
0
0
1
6.272715
6.272715
6.718392
7.269008
4_26-In-016
d1968_1974_ptf
2573310
1204328
calibration
poor
0
0
1
6.272715
6.160700
5.559031
5.161655
The dataset on soil samples from Bern holds 13 variables for 1052 entries (more information here):
site_id_unique: The location’s unique site id.
timeset: The sampling year and information on sampling type for soil pH (no label: CaCl\(_2\) laboratory measurement, field: indicator solution used in field, ptf: H\(_2\)O laboratory measurement transferred by pedotransfer function).
x: The x (easting) coordinates in meters following the (CH1903/LV03) system.
y: The y (northing) coordinates in meters following the (CH1903/LV03) system.
dataset: Specification whether a sample is used for model calibration or validation (this is based on randomization to ensure even spatial coverage).
dclass: Soil drainage class
waterlog.30, waterlog.50, waterlog.100: Specification whether soil was water logged at 30, 50, or 100 cm depth (0 = No, 1 = Yes).
ph.0.10, ph.10.30, ph.30.50, ph.50.100: Average soil pH between 0-10, 10-30, 30-50, and 50-100 cm depth.
3.1.1 Covariate data
Now, let’s load all the covariates that we want to produce our soil maps.
# Get a list with the path to all raster fileslist_raster <- base::list.files( here::here("data-raw/geodata/covariates/"),full.names = T )# Display data (lapply to clean names)lapply( list_raster, function(x) sub(".*/(.*)", "\\1", x) ) |>unlist() |>head(5) |>print()
The output above shows the first five raster files with rather cryptic names. For this tutorial, we do not need to know all of the 91 rasters files that we are loading here but in a scientific context, you should know your data better of course. Let’s look at one of these raster files to get a better feeling for our data:
# Load a raster file as example: Picking the slope profile at 2m resolutionraster_example <- terra::rast(list_raster[74])raster_example
class : SpatRaster
dimensions : 986, 2428, 1 (nrow, ncol, nlyr)
resolution : 20, 20 (x, y)
extent : 2568140, 2616700, 1200740, 1220460 (xmin, xmax, ymin, ymax)
coord. ref. : CH1903+ / LV95
source : Se_slope2m.tif
name : Se_slope2m
min value : 0.00000
max value : 85.11286
As shown in the output, a raster object has the following properties (among others, see ?terra::rast):
class: The class of the file, here a SpatRaster.
dimensions: The number of rows, columns, years (if temporal encoding).
resolution: The resolution of the coordinate system, here it is 20 in both axes.
extent: The extent of the coordinate system defined by min and max values on the x and y axes.
coord. ref.: Reference coordinate system. Here, the raster is encoded using the LV95 geodetic reference system from which the projected coordinate system CH1903+ is derived.
source: The name of the source file.
names: The name of the raster file (mostly the file name without file-specific ending)
min value: The lowest value of all cells.
max value: The highest value of all cells.
4 Mapping the study area
Now, let’s look at a visualisation of this raster file. Since we selected the slope at 2m resolution, we expect a relief-like map with a color gradient that indicates the steepness of the terrain.
# Plot raster exampleterra::plot(raster_example)
# To have some more flexibility, we can plot this in the ggplot-style as such:ggplot2::ggplot() + tidyterra::geom_spatraster(data = raster_example) + ggplot2::scale_fill_viridis_c(na.value =NA,option ="magma",name ="Slope (%) \n" ) + ggplot2::theme_bw() + ggplot2::scale_x_continuous(expand =c(0, 0)) +# avoid gap between plotting area and axis ggplot2::scale_y_continuous(expand =c(0, 0)) + ggplot2::labs(title ="Slope of the Study Area")
Note
Note that the second plot has different coordinates than the upper one. That is because the data was automatically projected to the World Geodetic System (WGS84, ESPG: 4326).
This looks already interesting but we can put our data into a bit more context. For example, a larger map background would be useful to get a better orientation of our location. Also, it would be nice to see where our sampling locations are and to differentiate these locations by whether they are part of the calibration or validation dataset. Bringing this all together requires some more understanding of plotting maps in R. So, don’t worry if you do not understand everything in the code chunk below and enjoy the visualisations:
# To get our map working correctly, we have to ensure that all the input data# is in the same coordinate system. Since our Berne data is in the Swiss # coordinate system, we have to transform the sampling locations to the # World Geodetic System first.# For the raster:r <- terra::project(raster_example, "+init=EPSG:4326")# For the sampling locations:# Function from Stackoverflow:# https://stackoverflow.com/questions/49536664/r-transforming-coordinates-inside-the-data-framechange_coords <-function(data, from_CRS, to_CRS) {# Load required packagerequire(sp)# Turn string into CRS from_CRS =CRS(from_CRS) to_CRS =CRS(to_CRS) new <-as.data.frame(spTransform(SpatialPointsDataFrame(coords =data.frame(lon = data$lon,lat = data$lat),data =data.frame(id = data$id,lon_old = data$lon,lat_old = data$lat),proj4string = from_CRS), to_CRS ) ) new <- new |> dplyr::select(coords.x1, coords.x2, id) |> dplyr::rename(lon = coords.x1,lat = coords.x2)return(new)}# Transform dataframescoord_cal <- bern_data |> dplyr::filter(dataset =="calibration") |> dplyr::select(site_id_unique, x, y) |> dplyr::rename(id = site_id_unique, lon = x, lat = y) |>change_coords(from_CRS ="+init=epsg:2056", to_CRS ="+init=epsg:4326" )coord_val <- bern_data |> dplyr::filter(dataset =="validation") |> dplyr::select(site_id_unique, x, y) |> dplyr::rename(id = site_id_unique, lon = x, lat = y) |>change_coords(from_CRS ="+init=epsg:2056", to_CRS ="+init=epsg:4326" )
# Loading packages to improve code readbility to avoid :: notation# Note: This code may only work when installing the development branch of {leaflet}:# remotes::install_github('rstudio/leaflet')library(leaflet)library(terra)# Let's get a nice color palette now for easy referencepal <-colorNumeric("magma",values(r),na.color ="transparent" )# Next, we build a leaflet mapleaflet() |># As base maps, use two provided by ESRIaddProviderTiles(providers$Esri.WorldImagery, group ="World Imagery") |>addProviderTiles(providers$Esri.WorldTopoMap, group ="World Topo") |># Add our raster fileaddRasterImage( r,colors = pal,opacity =0.6,group ="raster" ) |># Add markers for sampling locations leaflet::addCircleMarkers(data = coord_cal,lng =~lon, # Column name for x coordinateslat =~lat, # Column name for y coordinatesgroup ="training",color ="black" ) |> leaflet::addCircleMarkers(data = coord_val,lng =~lon, # Column name for x coordinateslat =~lat, # Column name for y coordinatesgroup ="validation",color ="red" ) |># Add some layout and legendaddLayersControl(baseGroups =c("World Imagery","World Topo"),position ="topleft",options =layersControlOptions(collapsed =FALSE),overlayGroups =c("raster", "training", "validation") ) |>addLegend(pal = pal,values =values(r),title ="Slope (%)")
Note
This plotting example is based to the one shown in the AGDS 2 tutorial “Handful of Pixels” on phenology and more information on using spatial data in R can be found in the Chapter on Geospatial data in R.
That looks great! At first glance, it is a bit crowded but once you scroll in you can investigate our study area quite nicely. You can check whether the slope raster file makes sense by comparing it against the base maps. Can you see how cliffs along the Aare river, hills, and even gravel quarries show high slopes? We also see that our validation dataset is nicely distributed across the area covered by the training dataset.
Now that we have played with a few visualizations, let’s get back to preparing our data. The {terra} package comes with the very useful tool to stack multiple raster on top of each other, if they are of the same spatial format. To do so, we just have to feed in the vector of file names list_raster:
# Load all files as one batchall_rasters <- terra::rast(list_raster)all_rasters
Now, we do not want to have the covariates’ data from all cells in the raster file. Rather, we want to reduce our stacked rasters to the x and y coordinates for which we have soil sampling data. We can do this using the terra::extract() function. Then, we want to merge the two dataframes of soil data and covariates data by their coordinates. Since the order of the covariate data is the same as the Bern data, we can simply bind their columns with cbind():
# Extract coordinates from sampling locationssampling_xy <- bern_data |> dplyr::select(x, y)# From all rasters, extract values for sampling coordinatescovar_data <- terra::extract(all_rasters, # The raster we want to extract from sampling_xy, # A matrix of x and y values to extract forID =FALSE# To not add a default ID column to the output )merged_data <-cbind(bern_data, covar_data)head(merged_data) |> knitr::kable()
site_id_unique
timeset
x
y
dataset
dclass
waterlog.30
waterlog.50
waterlog.100
ph.0.10
ph.10.30
ph.30.50
ph.50.100
be_gwn25_hdist
be_gwn25_vdist
cindx10_25
cindx50_25
geo500h1id
geo500h3id
lgm
lsf
mrrtf25
mrvbf25
mt_gh_y
mt_rr_y
mt_td_y
mt_tt_y
mt_ttvar
NegO
PosO
protindx
Se_alti2m_std_50c
Se_conv2m
Se_curv25m
Se_curv2m_fmean_50c
Se_curv2m_fmean_5c
Se_curv2m_s60
Se_curv2m_std_50c
Se_curv2m_std_5c
Se_curv2m
Se_curv50m
Se_curv6m
Se_curvplan25m
Se_curvplan2m_fmean_50c
Se_curvplan2m_fmean_5c
Se_curvplan2m_s60
Se_curvplan2m_s7
Se_curvplan2m_std_50c
Se_curvplan2m_std_5c
Se_curvplan2m
Se_curvplan50m
Se_curvprof25m
Se_curvprof2m_fmean_50c
Se_curvprof2m_fmean_5c
Se_curvprof2m_s60
Se_curvprof2m_s7
Se_curvprof2m_std_50c
Se_curvprof2m_std_5c
Se_curvprof2m
Se_curvprof50m
Se_diss2m_50c
Se_diss2m_5c
Se_e_aspect25m
Se_e_aspect2m_5c
Se_e_aspect2m
Se_e_aspect50m
Se_MRRTF2m
Se_MRVBF2m
Se_n_aspect2m_50c
Se_n_aspect2m_5c
Se_n_aspect2m
Se_n_aspect50m
Se_n_aspect6m
Se_NO2m_r500
Se_PO2m_r500
Se_rough2m_10c
Se_rough2m_5c
Se_rough2m_rect3c
Se_SAR2m
Se_SCA2m
Se_slope2m_fmean_50c
Se_slope2m_fmean_5c
Se_slope2m_s60
Se_slope2m_s7
Se_slope2m_std_50c
Se_slope2m_std_5c
Se_slope2m
Se_slope50m
Se_slope6m
Se_toposcale2m_r3_r50_i10s
Se_tpi_2m_50c
Se_tpi_2m_5c
Se_tri2m_altern_3c
Se_tsc10_2m
Se_TWI2m_s15
Se_TWI2m_s60
Se_TWI2m
Se_vrm2m_r10c
Se_vrm2m
terrTextur
tsc25_18
tsc25_40
vdcn25
vszone
4_26-In-005
d1968_1974_ptf
2571994
1203001
validation
poor
0
0
1
6.071733
6.227780
7.109235
7.214589
234.39087
1.2986320
-10.62191
-6.9658718
6
0
7
0.0770846
0.0184651
4.977099
1316.922
9931.120
58
98
183
1.569110
1.534734
0.0159717
0.3480562
-40.5395088
-0.0014441
-0.0062570
0.0175912
0.0002296
2.9204133
1.1769447
-1.9364884
0.0031319
-0.5886537
-0.0042508
-0.0445323
-0.0481024
-0.0504083
-0.1655090
1.5687343
0.6229440
-1.0857303
0.0007920
-0.0028067
-0.0382753
-0.0656936
-0.0506380
-0.0732220
1.6507173
0.7082230
0.8507581
-0.0023399
0.3934371
0.1770810
-0.9702092
-0.7929600
-0.5661940
-0.9939429
5.930607
6.950892
-0.2840056
-0.6084610
-0.2402939
-0.0577110
-0.7661251
1.562085
1.548762
0.3228087
0.2241062
0.2003846
4.000910
16.248077
0.9428899
0.6683306
0.9333237
0.7310556
0.8815832
0.3113754
1.1250136
0.3783818
0.5250366
0
-0.0940372
-0.0583917
10.319408
0.4645128
0.0032796
0.0049392
0.0011592
0.000125
0.0002450
0.6248673
0.3332805
1.784737
65.62196
6
4_26-In-006
d1974_1978
2572149
1202965
calibration
poor
0
1
1
6.900000
6.947128
7.203502
7.700000
127.41681
1.7064546
-10.87862
-11.8201790
6
0
7
0.0860347
0.0544361
4.975796
1317.000
9931.672
58
98
183
1.568917
1.533827
0.0204794
0.1484705
19.0945148
-0.0190294
0.0021045
0.0221433
0.0000390
3.8783867
4.3162045
2.1377332
-0.0171786
0.1278165
-0.0119618
-0.0501855
-0.3270764
-0.1004921
-0.5133076
2.0736780
2.2502327
-0.3522736
-0.0073879
0.0070676
-0.0522900
-0.3492197
-0.1005311
-0.4981292
2.1899190
2.4300070
-2.4900069
0.0097907
0.4014700
0.7360508
0.5683194
0.8753148
-0.3505180
0.3406741
5.984921
6.984581
-0.5732749
0.4801802
0.4917848
-0.4550385
0.7722272
1.543384
1.558683
0.2730940
0.2489859
0.2376962
4.001326
3.357315
1.0895698
0.9857153
1.0231543
1.0398037
1.0152543
0.5357812
1.3587183
0.0645478
0.5793087
0
-0.0014692
0.0180000
12.603136
0.5536283
0.0070509
0.0067992
0.0139006
0.000300
0.0005389
0.7573612
0.3395441
1.832904
69.16074
6
4_26-In-012
d1974_1978
2572937
1203693
calibration
moderate
0
1
1
6.200000
6.147128
5.603502
5.904355
143.41533
0.9372618
22.10210
0.2093917
6
0
7
0.0737963
3.6830916
4.986864
1315.134
9935.438
58
98
183
1.569093
1.543057
0.0048880
0.1112066
-9.1396294
0.0039732
0.0009509
0.0431735
0.0034232
0.7022317
0.4170935
-0.4178924
-0.0026431
-0.0183221
0.0015183
-0.0079620
0.0053904
-0.0091239
-0.0110896
0.3974485
0.2292406
-0.2168447
-0.0013561
-0.0024548
-0.0089129
-0.0377831
-0.0125471
-0.0052359
0.4158890
0.2700820
0.2010477
0.0012870
0.6717541
0.4404107
-0.6987815
-0.3866692
-0.1960597
-0.7592779
5.953919
6.990917
-0.3006475
-0.9221049
-0.9633239
-0.3257418
-0.9502072
1.565405
1.563151
0.2305476
0.2182523
0.1434273
4.000320
11.330072
0.5758902
0.5300468
0.5107915
0.5744110
0.4975456
0.2001768
0.7160403
0.1311051
0.4620202
0
0.0340407
-0.0145804
7.100000
0.4850160
0.0021498
0.0017847
0.0011398
0.000000
0.0000124
0.7978453
0.4455501
1.981526
63.57096
6
4_26-In-014
d1974_1978
2573374
1203710
validation
well
0
0
0
6.600000
6.754607
7.200000
7.151129
165.80418
0.7653937
-20.11569
-7.7729993
6
0
7
0.0859686
0.0075817
5.285522
1315.160
9939.923
58
98
183
1.569213
1.542792
0.0064054
0.3710849
-0.9318936
-0.0371234
0.0029348
-0.1056513
0.0127788
1.5150748
0.2413423
-0.0289909
0.0020990
-0.0706228
-0.0113604
-0.0301961
-0.0346193
-0.0273140
-0.0343277
0.8245047
0.1029889
-0.0272214
-0.0041158
0.0257630
-0.0331309
0.0710320
-0.0400928
0.0529446
0.8635767
0.1616543
0.0017695
-0.0062147
0.4988544
0.4217250
-0.8485889
-0.8657616
-0.8836724
-0.8993938
4.856076
6.964162
-0.5735765
-0.4998477
-0.4677161
-0.4121092
-0.4782534
1.562499
1.562670
0.3859352
0.2732429
0.1554769
4.000438
42.167496
0.8873205
0.8635756
0.9015982
0.8518201
0.5767300
0.2149791
0.8482135
0.3928713
0.8432562
0
0.0686932
-0.0085602
8.303085
0.3951114
0.0008454
0.0021042
0.0000000
0.000100
0.0000857
0.4829135
0.4483251
2.113142
64.60535
6
4_26-In-015
d1968_1974_ptf
2573553
1203935
validation
moderate
0
0
1
6.272715
6.272715
6.718392
7.269008
61.39244
1.0676192
-55.12566
-14.0670462
6
0
7
0.0650000
0.0007469
5.894688
1315.056
9942.032
58
98
183
1.570359
1.541979
0.0042235
0.3907509
4.2692256
0.0378648
0.0022611
-0.1020419
0.0161510
3.6032522
1.8169731
0.6409346
0.0346340
0.0476020
0.0378154
-0.0179657
-0.0137853
-0.0146946
0.0060875
1.4667766
0.9816071
0.2968794
0.0337645
-0.0000494
-0.0202268
0.0882566
-0.0308456
0.0929077
2.6904552
1.0218329
-0.3440553
-0.0008695
0.6999696
0.3944107
-0.8918364
-0.8864348
-0.7795515
-0.4249992
4.130917
6.945287
0.4304937
0.4614536
0.5919228
0.6559467
0.4574654
1.550528
1.562685
0.4330348
0.3299487
0.1889674
4.000948
5.479310
1.8937486
1.2098556
1.5986075
1.2745584
2.7759163
0.5375320
1.2301254
0.3582314
1.1426100
0
0.3005829
0.0061576
10.110727
0.5134069
0.0043268
0.0045225
0.0054557
0.000200
0.0002062
0.6290755
0.3974232
2.080674
61.16533
6
4_26-In-016
d1968_1974_ptf
2573310
1204328
calibration
poor
0
0
1
6.272715
6.160700
5.559031
5.161655
310.05014
0.1321367
-17.16055
-28.0693741
6
0
7
0.0731646
0.0128017
5.938320
1315.000
9940.597
58
98
183
1.569434
1.541606
0.0040683
0.1931891
-0.1732794
-0.1602274
-0.0035833
-0.1282881
0.0003549
1.5897882
0.8171870
0.0318570
-0.0123340
0.0400775
-0.0813964
-0.0049875
0.0320331
-0.0049053
0.0374298
0.7912259
0.3455668
0.0100844
-0.0059622
0.0788309
-0.0014042
0.1603212
-0.0052602
0.0867119
1.0207798
0.6147888
-0.0217726
0.0063718
0.3157751
0.5292308
-0.8766075
0.5905659
0.8129975
0.1640853
2.030315
6.990967
0.6325440
0.8054439
0.5820994
0.7448481
0.6081498
1.563066
1.552568
0.3688371
0.2607146
0.1763995
4.000725
13.499996
1.0418727
0.8515157
1.2106605
0.8916541
1.2163279
0.4894866
1.0906221
0.2049688
0.7156029
0
-0.0910767
0.0034276
9.574804
0.3864355
0.0001476
0.0003817
0.0000000
0.000525
0.0001151
0.6997021
0.4278295
2.041467
55.78354
6
Great that worked without problems!
5 Spatial correlation
As touched in Chapter 1, we should find a way to take spatial auto-correlation into account. Currently, our dataset holds spatial information in the x- and y-coordinates. The risk of including these coordinates is that a model learns the spatial features of our study area but not the generalisable underlying processes that drive soil development. For example, if our study area has a tendency to have a given soil type towards the north, a model may just learn this and predict a higher probability of that soil type with increasing y-coordinates. This tendency, however, may not be always true (for more, see Chapter 6).
Another consequence of just using x- and y-coordinates is that, when we use that model for prediction, the final map will most-likely look like a chess-board. Although it does not alleviate the issue of spatial autocorrelation, a quick fix to avoid a chess-board pattern is to add rotated coordinates to your dataset. That means that for every pixel, you rotate its x- and y-coordinates by a certain angle (see Figure Figure 5.1).1 This allows a Random Forest model to not just split across the orthogonal x- and y-coordinates but also along the rotated axes. Here, we are going to shift the coordinate system by 30° and 60° as follows:
Figure 5.1: Visualisation of rotating x- and y-coordinates by 30° and 60°. Figure adapted from work by M. Nussbaum.
Note that not all our covariates may be continuous variables and therefore have to be encoded as factors. As an easy check, we can take the original corvariates data and check for the number of unique values in each raster. If the variable is continuous, we expect that there are a lot of different values - at maximum 1052 different values because we have that many entries. So, let’s have a look and assume that variables with 10 or less different values are categorical variables.
cat_vars <- covar_data |># Get number of distinct values per variable dplyr::summarise(dplyr::across(dplyr::everything(), ~ dplyr::n_distinct(.))) |># Turn df into long format for easy filtering tidyr::pivot_longer(dplyr::everything(), names_to ="variable", values_to ="n") |># Filter out variables with 10 or less distinct values dplyr::filter(n <=10) |># Extract the names of these variables dplyr::pull('variable')cat("Variables with less than 10 distinct values:", cat_vars)
Variables with less than 10 distinct values: geo500h1id geo500h3id lgm vszone
Now that we have the names of the categorical values, we can mutate these columns in our df using the base function as.factor():
We are almost done with our data preparation, we just need to reduce it to sampling locations for which we have a decent amount of data on the covariates. Else, we blow up the model calibration with data that is not informative enough.
# Get number of rows to calculate percentagesn_rows <-nrow(final_data)# Get number of distinct values per variablefinal_data |> dplyr::summarise(dplyr::across(dplyr::everything(), ~length(.) -sum(is.na(.)))) |> tidyr::pivot_longer(dplyr::everything(), names_to ="variable", values_to ="n") |> dplyr::mutate(perc_available =round(n/n_rows *100)) |> dplyr::arrange(perc_available) |>head(10) |> knitr::kable()
variable
n
perc_available
ph.30.50
856
81
ph.10.30
866
82
ph.50.100
859
82
timeset
871
83
ph.0.10
870
83
dclass
1006
96
site_id_unique
1052
100
x
1052
100
y
1052
100
dataset
1052
100
This looks good, we have no variable with a substantial amount of missing data. Generally, only pH measurements are lacking, which we should keep in mind when making predictions and inferences. Another great way to explore your data, is using the {visdat} package:
final_data |> visdat::vis_miss()
Alright, we see that we are not missing a lot of data for any variable and in total only 1% of our data is NA. That is good enough! We do not have to modify the dataframe any further and can save it for further analysis.